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Abstract 

We consider classification and regression tasks where we have missing data and assume that the (clean) data 
resides in a low rank subspace. Finding a hidden subspace is known to be computationally hard. Nevertheless, using 
a non-proper formulation we give an efficient agnostic algorithm that classifies as good as the best linear classifier 
coupled with the best low-dimensional subspace in which the data resides. A direct implication is that our algorithm 
can linearly (and non-linearly through kernels) classify provably as well as the best classifier that has access to the 
full data. 


1 Introduction 

The importance of handling correctly missing data is a fundamental and classical challenge in machine learning. There 
are many reasons why data might be missing. For example, consider the medical domain, some data might be missing 
because certain procedures were not performed on a given patient, other data might be missing because the patient 
choose not to disclose them, and even some data might be missing due to malfunction of certain equipment. While it 
is definitely much better to have always complete and accurate data, this utopian desire is not the reality many times. 
For this reason we need to utilize the available data even if some of it is missing. 

Another, very different motivation for missing data are recommendations. For example, a movie recommendations 
dataset might have users opinions on certain movies, which is the case, for example, in the Netflix motion picture 
dataset. Clearly, no user has seen or reviewed all movies, or even close to it. In this respect recommendation data is 
an extreme case: the vast majority is usually missing (i.e., it is sparse to the extreme). 

Many times we can solve the missing data problem since the data resides on a lower dimension manifold. In the 
above examples, if there are prototypical users (or patients) and any user is a mixture of the prototypical users, then 
this implicitly suggests that the data is low rank. Another way to formalize this assumption is to consider the data in a 
matrix form, say, the users are rows and movies are columns, then our assumption is that the true complete matrix has 
a low rank. 

Our starting point is to consider the low rank assumption, but to avoid any explicit matrix completion, and instead 
directly dive in to the classification problem. At the end of the introduction we show that matrix completion is neither 
sufficient and/or necessary. 

We consider perhaps the most fundamental data analysis technique of the machine learning toolkit: linear (and 
kernel) classification, as applied to data where some (or even most) of the attributes in an example might be missing. 
Our main result is an efficient algorithm for linear and kernel classification that performs as well as the hest 
classifier that has access to all data, under low rank assumption with natural non-degeneracy conditions. 

We stress that our result is worst case, we do not assume that the missing data follows any probabilistic rule other 
than the underlying matrix having low rank. This is a clear contrast to most existing matrix completion algorithms. We 
also cast our results in a distributional setting, showing that the classification error that we achieve is close to the best 
classification using the subspace of the examples (and with no missing data). Notably, many variants of the problem 
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of finding a hidden subspace are computationally hard (see e.g. Berthet & Rigollet ( 2013| l), yet as we show, learning 
a linear classifier on a hidden subspace is non-properly learnable. 

At a high level, we assume that a sample is a triplet (x, o, y), where x G is the complete example, o C 
{1,..., d} is the set of observable attributes and y G 3^ is the label. The learner observes only (x^, y), where Xq omits 
any attribute not in o. Our goal is given a sample S = to output a classiher hg such that w.h.p.: 


E[i{hs{xo),y)] < min E [f(w • x, y)] + e, 

||w||<l 


where £ is the loss function. Namely, we like our classiher hg to compete with the best linear classiher for the 
completely observable data. 

Our main result is achieving this task (under mild regularity conditions) using a computationally efficient algorithm 
for any convex Lipschitz-bounded loss function. Our basic result requires a sample size which is quasi-polynomial, 
but we complement it with a kernel construction which can guarantee efficient learning under appropriate large mar¬ 
gin assumptions. Our kernel depends only on the intersection of observable values of two inputs, and is efficiently 
computable. (We give a more detailed overview of our main results in Section]^) 

Preliminary experimental evidence indicates our theoretical contributions lead to promising classihcation perfor¬ 
mance both on synthetic data and on publicly-available recommendation data. This will be detailed in the full version 
of this paper. 


Previous work. Classihcation with missing data is a well studied subject in statistics with numerous books and 
papers devoted to its study, (see, e.g.. Little & Rubin ( 2002| l). The statistical treatment of missing data is broad, and to 
a fairly large extent assumes parametric models both for the data generating process as well as the process that creates 
the missing data. One of the most popular models for the missing data process is Missing Completely at Random 
(MCAR) where the missing attributes are selected independently from the values. 

We outline a few of the main approaches handling missing data in the statistics literature. The simplest method 
is simply to discard records with missing data, even this assumes independence between the examples with missing 
values and their labels. In order to estimate simple statistics, such as the expected value of an attribute, one can use 
importance sampling methods, where the probability of an attribute being missing can depend on it value (e.g., using 
the Horvitz-Thompson estimator Horvitz & Thompson ( 1952| l). A large body of techniques is devoted to imputation 
procedures which complete the missing data. This can be done by replacing a missing attribute by its mean (mean 
imputation), or using a regression based on the observed value (regression imputation), or sampling the other examples 
to complete the missing value (hot deck). [^The imputation methodologies share a similar goal as matrix completion, 
namely reduce the problem to one with complete data, however their methodologies and motivating scenarios are 
very different. Finally, one can build a complete Bayesian model for both the observed and unobserved data and 
use it to perform inference. As with almost any Bayesian methodology, its success depends largely on selecting the 
right model and prior, this is even ignoring the computational issues which make inference in many of those models 
computationally intractable. 

In the machine learning community, missing data was considered in the framework of limited attribute observabil¬ 
ity Ben-David & Dichterman (1998 1 and its many refinements Dekel et al. (20101; Cesa-Bianchi et al. ( |2010 |2011| l; 
Hazan & Koren ( 2012|l. However, to the best of our knowledge, the low-rank property is not captured by previous 


work, nor is the extreme amount of missing data. More importantly, much of the research is focused on selecting 
which attributes to observe or on missing attributes at test or train time (see also Eban et al. ( 2014[); |Globerson~& 


Roweis (2006|l). In our case the learner has no control which attributes are observable in an example and the domain 


is fixed. The latter case is captured in the work of Chechik et al. (20081, who rescale inner-products according to the 
amount of missing data. Their method, however, does not entail theoretical gaurantees on reconstruction in the worst 
case, and gives rise to non-convex programs. 

A natural and intuitive methodology to follow is to treat the labels (both known and unknown) as an additional 
column in the data matrix and complete the data using a matrix completion algorithm, thereby obtaining the classifi¬ 
cation. Indeed, this exactly was proposed by Goldberg et al. (2010|l. Although this is a natural approach, we show that 


* We remartc that our model implicitly includes mean-imputation or 0-imputation method and therefore will always outperform them. 
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completion is neither necessary nor sufficient for classification. Furthermore, the techniques for provably completing 


a low rank matrix are only known under probabilistic models with restricted distributions Srebro (20041; Candes & 
|Recht| ( |2009| ); |Lee et ah] ( |2010| l; |Salakhutdinov & Srebro| ( |2010| l; [Shamir & Shalev-Shwa r^ ( 20njl. The only non- 
probabilistic matrix completion algorithm in the online learning setting we are aware of is Hazan et al. ( 2012| l, which 
we were not able to use for our purposes. 


Is matrix completion sufficient and/or necessary? We demonstrate that classification with missing data is provably 
different from that of matrix completion. We start by considering a learner that tries to complete the missing entries 
in an unsupervised manner and then performs classification on the completed data, this approach is close akin to 
imputation techniques, generative models and any other two step - unsupervised/supervised algorithm. Our example 
shows that even under realizable assumptions, such an algorithm may fail. We then proceed to analyze the approach 
previously mentioned - to treat the labels as an additional column. 

To see that unsupervised completion is insufficient for prediction, consider the example in Figure [T] the original 
data is represented by filled red and green dots and it is linearly separable. Each data point will have one of its 
two coordinates missing (this can even be done at random. In the figure the arrow from each instance points to the 
observed attribute. However, the rank-one completion of projection onto the pink hyperplane is possible, and admits no 
separation. The problem is clearly that the mapping to a low dimension is independent from the labels, and therefore 
we should not expect that properties that depend on the labels, such as linear separability, will be maintained. 



Figure 1: Linearly separable data, for which certain completions make the data is non-separable. 


Next, consider a learner that treats the labels as an additional column. Goldberg et al. ( 2010| l Considered the 
following problem; 


minimize rank(Z’) 

subject to: Zij = (/, j) G H ,. 


(G) 


where H is the set of observed attributes (or observed labels for the corresponding columns). Now assume that we 
always see one of the following examples: [1, *, 1, *],[*, —1, — 1], or [1, 1, 1, — 1]. The observed labels 

are respectively 1,-1 and 1. A typical data matrix with one test point might be of the form: 


■ 1 


1 

* I 

1 ■ 

* 

-1 

* 

-1 

-1 

1 

-1 

1 

-1 

1 

1 


1 

* 

* 


First note that there is no 1-rank completion of this matrix. On the other hand, we will show that there is more than one 
2-rank completion each lead to a different classification of the test point. The first possible completion is to complete 
odd columns to a constant one vector, and even column vectors to a constant —1 vector. Then complete the labeling 
whichever way you choose. Clearly there is no hope for this completion to lead to any meaningful result as the label 
vector is independent of the data columns. On the other hand we may complete the first and last rows to a constant 1 
vector, and the second row to a constant —1 vector. All possible completions lead to an optimal solution w.r.t Problem 
l^but have different outcome w.r.t classification. We stress that this is not a sample complexity issue. Even if we 
observe abundant amount of data, the completion task is still ill-posed. 
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Finally, matrix completion is also not necessary for prediction. Consider movie recommendation dataset with two 
separate populations, French and Chinese, where each population reviews a different set of movies. Even if each 
population has a low rank, performing successful matrix completion, in this case, is impossible (and intuitively it does 
not make sense in such a setting). However, linear classification in this case is possible via a single linear classifier, 
for example by setting all non-observed entries to zero. For a numerical example, return to the matrix M in Eq. 
Note that we observe only three instances hence the classification task is easy but doesn’t lead to reconstruction of the 
missing entries. 


2 Problem Setup and Main Result 

We begin by presenting the general setting: A vector with missing entries can be modeled as a tuple x x o, where 
X S and o G 2^^ is a subset of indices. The vector x represents the full data and the set o represents the observed 
attributes. Given such a tuple, let us denote by Xq a vector in (M U {*})‘^ such that 


(^o)z 


X,; i G O 

* else 


The task of learning a linear classifier with missing data is to return a target function over Xq that competes with 
best linear classifier over x. Specifically, a sequence of triplets is drawn iid according to some 

distribution D. An algorithm is provided with the sample S = {(x* ^and should return a target function fs 
over missing data such that w.h.p: 


E[f(/s(xo), 2 /))] < min E [£(w • x, y))]-f e, 

wGBd(r) 


( 2 ) 


where i is the loss function and Bd{r) denotes the Euclidean ball in dimension d of radius ^/r. Eor brevity, we will 
say that a target function fs is e-good if Eq. j^holds. 

Without any assumptions on the distribution D, the task is ill-posed. One can construct examples where the 
learner over missing data doesn’t have enough information to compete with the best linear classifier. Such is the case 
when, e.g., is some attribute that is constantly concealed and independent of all other features. Therefore, certain 
assumptions on the distribution must be made. 

One reasonable assumption is to assume that the marginal distribution D over x is supported on a small dimen¬ 
sional linear subspace E and that for every set of observations, we can linearly reconstruct the vector x from the vector 
PqX, where Po ^ K'”' is the projection on the observed attributes. In other words, we demand that the mapping 
Po\E ■ E PqE, which is the restriction of Pq to E, is full-rank. As the learner doesn’t have access to the subspace 
E, the learning task is still far from trivial. 

We give a precise definition of the last assumption in Assumption [T] Though our results hold under the low rank 
assumption the convergence rates we give depend on a certain regularity parameter. Roughly, we parametrize the 
distance” of Po\e from singularity, and our results will quantitively depend on this distance. Again, we defer all 


rigorous definitions to Section 3.2 


Our first result is a an upper bound on the sample complexity of the problem. We then proceed to a more general 
statement that entails an efficient kernel-based algorithm. 


2.1 Main Result 


Theorem 1 (Main Result). Assume that i is a L-Lipschitz convex loss function Let D be a X-regular distribution (see 
Definitior^^ Let 7 (e) > and 

r(e) = 


d-l 


There exists an algorithm (independent of D) that receives a sample S = {(x^;, y^)}™ ofsizem G Ll r(£)^^iog i/( 5 ^ 
and returns a target function fs that is e-good with probability at least (1 — 8). The algorithm runs in time poly(|S'|). 
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Theoremgives an upper bound on the computational and sample complexity of learning a linear classifier with 
missing data under the low rank assumption. As the sample complexity is quasipolynomial, this has limited practical 
value in many situations. However, as the next theorem states, fs can actually be computed by applying a kernel trick. 
Thus, under further large margin assumptions we can significantly improve performance. 

Theorem 2. For every 7 > 0, there exists an embedding over missing data 

: Xo —> 

such that r = 'n'k=i ~ scalar product between two samples (/)-^(x^i) and can be 

efficiently computed, specifically it is given by the formula: 


k fxi X^ ) - 

.- |q( 1) po(2)| _ 1 


E 

z€o^i)no(2) 


xW.xP). 


In addition, let i be an L-Lipschitz loss function and S = {x^^ o sample drawn iid according to a distribution D. 
We make the assumption that UPoxH < 1 a.s. The followings hold: 

1. At each iteration of Alg. |^we can efficiently compute stj(p^ 
given by the formula 


t) for any new example x^*. Specifically it is 


y(Xo*) := E“i 


2 = 1 


Hence Alg.^^runs in y>o\y{T) time and sequentially produces target functions ft {xo) = vj that can be 

computed at test time in poly(T) time. 

2. Run Alg. pt = Y’ P ~ ^ ^ “ T then with probability (1 — 5).' 


^IIv|P + -5]£C 


m 


2 = 1 


< min-||v|| 


C 

m 


l:wv7,(x‘,),..)]+6(mtiZ£).(3, 


3 . For any e > 0, if D is a X-regular distribution and 7 > then for some v* S i?r(r) 

E[e{v* ■ (j)^{xo,y)] < min E [f(w • (;()..y(xo), 1 /)] + e. 

wGBd(l) 

To summarize, Theoremj^states that we can embed the sample points with missing attributes in a high dimensional, 
finite, Hilbert space of dimension T, such that; 

• The scalar product between embeded points can be computed efficiently. Hence, due to the conventional repre¬ 
senter argument, the task of empirical risk minimization is tractable. 

• Following the conventional analysis of kernel methods; Under large margin assumptions in the ambient space, 
we can compute a predictor with scalable sample complexity and computational efficiency. 

• Finally, the best linear predictor over embedded sample points in a vl"-ball is comparable to the best linear 
predictor over fully observed data. 

Taken together, we can learn a predictor with sample complexity r2(r^(e)/e^ log |) and Theoremholds. 

For completeness we present the method together with an efficient algorithm that optimizes the RHS of Eq. [^vit 


an SGD method. The optimization analysis is derived in a straightforward manner from the work of|Shalev-Shwartz 


et al. (201 l|l. Other optimization algorithms exist in the literature, and we chose this optimization method as it allows 


us to also derive regret bounds which are formally stronger (see Section 2.2 1 . We stress that the main novelty of this 
paper is not in any specific optimization algorithm, but the introduction of a new kernel and our guarantees rely solely 
on it. 

Finally, note that fii induces the same scalar product as a 0-imputation. In that respect, by considering different 
7 = 1,2,... and using a holdout set we can guarantee that our method will outperform the 0-imputation method. By 
normalizing or adding a bias term we can in fact compete with mean-imputation or any other first order imputation. 
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2.2 Regret minimization for joint subspace learning and classification 

A significant technical contribution of this manuscript is the agnostic learning of a subspace coupled with a linear 
classifier. A subspace is represented by a projection matrix Q G which satisfies = Q. Denote the following 

class of target functions 

■^0 = {/w,Q : W e Bd, Q e Maxd, = Q} 

where ,/w,q(xo) is the linear predictor defined by w over subspace defined by the matrix Q, as formally defined in 
definition | 2 ] 

Given the aforementioned efficient kernel mapping (j>^, we consider the following kernel-gradient-based online 
algorithm for classification called KARMA (Kernelized Algorithm for Risk-minimization with Missing Attributes). 


Algorithm 1 KARMA: Kernelized Algorithm for Risk-minimization with Missing Attributes 

1 : 

Input: parameters 7 > 1, 

1 — 1 

V 

V 

0 

A 

B>0 

2 : 

for f = 1 to T do 



3: 

Observe example (x^t 

,yt), suffer loss £{vj (j 


4: 

Update 

0 ^ = < 

' {1 - Ptp) ■ i<t 



-Vt£' {vj i = t 




^ 0 else 



t 

vt-ri AVtKO 

5: 

end for 




Our main result for the fully adversarial online setting is given next, and proved in the Appendix. Notice that the 
subspace E* and associated projection matrix Q* are chosen by an adversary and unknown to the algorithm. 

Theorem 3. For any 7 > 1, A > 0, X > 0, p > 0, i? > 0, L-Lipschitz convex loss function I, and X-regular sequence 
{(x*, o*, j/t)} w.r.t subspace E* and associated projection matrix Q* such that ||x*||oo < X, Run Algorithm^with 
{rjt — sequentially outputs {vj G i?*} such that 


or 2 ^ 2 p „ p — ^1 

- min < - (1 -f logT) -k ■ B + —-LT 

J l|wi|<l“ P I A 

In particular, taking p = 7 = ^ log T we obtain 

- niin = 0{XLVTBT) 

t t 

3 Preliminaries and Notations 

3.1 Notations 

As discussed, we consider a model where a distribution D is fixed over x C> x where 0 = 2'^ consists of all 
subsets of {1,..., d}. We will generally denote elements of by x, w, v, u and elements of O by o. We denote by 
Bd the unit ball of and by Bd{r) the ball of radius y/r. 

Given a subset o we denote by Po ^ the projection onto the indices in o, i.e., if < 12 < • • • < ffe are 
the elements of o in increasing order then {PoXjj = x^^ . Given a matrix A and a set of indices o, we let 

^ 0,0 = PqAPJ . 
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3.2 Model Assumptions 

Definition 1 (A-regularity). We say that D is X-regular with associated subpsace E if the following happens with 
probability 1 (w.r.t the joint random variables (x, o)j; 

7. IjPoxll < 1. 

2. X € 77. 

3 . ker(Pof£;) = ker(PE) 

4. IfXo > 0 is a strictly positive singular value of the matrix PqPe then Xq > A. 

Assumption 1 (Low Rank Assumption). We say that D satisfies the low rank assumption with asscoicated subspace 
E if it is X-regular with associated subspace E for some A > 0. 

Note that in our setting we assume that ||7^ox|| < 1 a.s. If ||x|| < 1 then ||7^ox|| < 1 hence our assumption is 
weaker then assuming x is contained in a fixed sized ball. Further, the assumption can be verified on a sample set with 
missing attributes. 

Note also that we’ve normalized both w and Xq. To achieve guarantees that scale with ||w|j, note that we can 
replace the loss function £{w ■ x, y) with £{p • w • x, y) for any constant p. This will replace L-Lipschitness with 
p ■ L-Lipschitzness in all results. 

4 Learning under low rank assumption and A-regularity. 

Definition 2 (The class define the following class of target functions 

•^0 = {/w,Q : w e Q e Mdxd, = Q} 

where 

/w.q(xo) = (T’ow) • ■ (Pox). 

(Here denotes the pseudo inverse of M.) 

The following Lemma states that, under the low rank assumption, the problem of linear learning with missing data 
is reduced to the problem of learning the class Pq, in the sense that the hypothesis class Pq is not less-expressive. 

Lemma 1. Let D be a distribution that satisfies the low rank assumption. For every w* G there is f^Q € Po 
such that a.s: 

= w* -x. 

In particular Q = Pe and w = P^xv* , where Pe is the projection matrix on the subspace E. 

4.1 Approximating jpQ under regularity 

We next define a surrogate class of target functions that approximates Pq 
D efinition 3 (The classes P"^). For every 7 we define the following class 

= {/w.Q : w e Bd{i), Q e = Q} 

where, 

t-i 

/w,q(Xo) = (PoW) • {Qo^oY ■ (PoX) 

i=0 
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Lemma 2. Let (x, o) be a sample drawn according to a X-regular distribution D with associated subspace E. Let 
Q = Pe and |jw|j < 1 then a.s: 


II/w,/-q(Xo) - /w,q(Xo)|| < ■ 

Corollary 1. Let i be a L-Lipschitz function. Under X-regularity, for every 7 > the class J-^ contains an 

e-good target function. 


4.2 Improper learning of and a kernel trick 

Let G be the set of all finite, non empty, sequences of length at most 7 over d. For each s S G denote |s|- the length 
of the sequence and Send the last element of the sequence. Given a set of observations o we write s C o if all elements 
of the sequence s belong to o. We let 

r = X:d' = |G| = ''‘'"‘"‘' 

f=l 

and we index the coordinates of by the elements of G: 

Definition 4. We let x O) ^ be the embedding: 


d-l 


(<^7(Xo))s = 


s C o 
else 


Lemma 3. For every Q and w we have: 

/w,q(Xo) = 


SiGO 


Si I ^ ^ ^Si ' Qsi,S2 * Qs2,S3 ' 

{s:sCo. 2<|s|<t} 


■ Qs|s|_i,, 


|s| — 1 »Send Send 


Corollary 2. For every g G there is v G Br{T), such that: 

/w,q(Xo) = V • fjiXo). 

Ai a corllary, for every loss function I and distribution D we have that: 

^ E [f(v • (^(xo), y)] < E iifZ.gi^o), y) 

Due to Corollary]^ learning can be improperly done via learning a linear classifier over the embedded sample 
set {(^.y(xo)}^i. While the ambient space may be very large, the computational complexity of the next optimiza¬ 
tion scheme is actually dependent on the scalar product between the embedded samples. For that we give the following 
result that shows that the scalar product can be computed efficiently: 


Theorem 4. 




(2),^ ^ ioino2p-i 

|oi n 02I - 1 


E 

feGoino2 




{We use the convention that 377 ^ = lima;_>.i f—p = j) 


8 










5 Discussion and future work 


We have described the first theoretically-sound method to cope with low rank missing data, giving rise to a classifi¬ 
cation algorithm that attains competitive error to that of the optimal linear classifier that has access to all data. Our 
non-proper agnostic framework for learning a hidden low-rank subspace comes with provable guarantees, whereas 
heuristics based on separate data reconstruction and classification are shown to fail for certain scenarios. 

Our technique is directly applicable to classification with low rank missing data and polynomial kernels via kernel 
(polynomial) composition. General kernels can be handled by polynomial approximation, but it is interesting to think 
about a more direct approach. 

It is possible to derive all our results for a less stringent condition than A-regularity: instead of bounding the 
smallest eigenvalue of the hidden subspace, it is possible to bound only the ratio of largest-to-smallest eigenvalue. 
This results in better bounds in a straightforward plug-and-play into our analysis, but was ommitted for simplicity. 
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A Proofs of theorems and lemmas from main text 

A.l Technical Claims 

Claim 1. Let Q S M^xd be o square projection matrix and P C M^xd o matrix. Recall that: 

Im(A) = {v : 3u Au = v}, and ker(A) = {v : Av = 0}. 

And that rank(A) is the size of the largest collection of linearly independent columns of A. 

The following statements are equivalent: 

1. ker(PQ) = ker((3). 

2. rank(P(3) = rank((5P^) = rank(PQP^) = rank((5). 

3. Im(QP^) = Im(Q). 

Proof. 

Clearly rank(P(3) < rank((5). If rank(P(3) < rank((5) we must have some collection of linearly independent 
columns of Q that are linearly dependent in PQ this implies that there is v such that PQv = 0 but Qv f 0. 
Hence ker(PQ) f ker(Q) and thus a contradiction, we conclude that rank(PQ) = rank((5)- 

That rank(P(3) = rank((5P^) = rank(PQP^) follows from the fact that rank( A) = rank(A^) = rank(AA^) 
and using the fact that = Q since Q is a projection matrix. 

|2l4>[3 We have that ImjQP^) L Im((3). The two subspaces, Im((5P^) and Im((5), are in fact the linear span of the 
columns of QP^ and Q respectively. 

Since rank((5P^) = rankjQ) we conclude that the dimension of the two subspaces is equal. It follows that 
Im(QpT) = Im(Q). 

[3]=>[T] Since Im(QP^) = Im((5) we also have rank((3P^) = rankjQ) and as a corollary rank(P(3) = rank((5). 

Now by the rank-nullity Theorem, for every A £ M^xdi dim(ker(A)) = d — rank(A). 

Hence dim(ker(PQ)) = dim(ker((5)). Since ker(P(5) C ker((5) we must have . kerjPQ) = ker((5). 
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□ 

Claim 2. Let o G 2'^ be drawn according to a distribution D that satisfies the low rank assumption. IfQ = Pe then: 

Im((5o,o) = Im(PoQ) 

Proof. kei^PoQ) = ker((3) holds by assumption (assumption]^ in Definition]^. Im((5) = Im((3Pj^) then follows 

from item M In particular Im(PoQ) = Ini(PoQ^o^) = Im((5o,o)- n 

A.2 proof of Lemma ]J 

By definition, if PqX G Im((3o,o) then Qo.o (Qo.o)^ ^oX = PqX. We claim that due to the low rank assumption, 

PqX e Im((5o,o)- 

Indeed, recall that Q = Pe and x G E hence Qx = x and P^x G Im(PoQ)- By Claim]^we have Im((5o.o) = 
Im(PoQ), hence PqX G (ImQo.o)- 
Next, we have that 

PoQPj (Qo,o)^ -PqX = Qo,o (Qo,o)^ -PqX = PqX 

Alternatively 


^o(QPo^QL^oX-x) =0. (4) 

Again, since Qx = x we have that; 

PaQiPjQUPoX-x)=0. ( 5 ) 

The low rank assumption implies that PqQ'v = 0 if and only if Qv = 0. Apply this to v = P^ QJjPqX — x and get: 

QP2 Qo.oPoy^ = Qx = x. 

Finally we have that 


/w,q(xo) = {PoQ ' w*) • Qjj o^’oX = w* • QP2 Qi,o^oX = w* • X. 


A.3 proof of Lemma ]^ 


Let / denote the identity matrix in Mdxd- First note that (/q.o — Qo.o) = (F — Q)o,o and that /o,o is the identity 
matrix in 

Let vi,..., Vfe be the normalized and orthogonal eigen-vectors of Qo.o with strictly positive eigenvalues Ai > 

..., Afe. By A-regularity we have that A/c > A and since the spectral norm of Qo,o is smaller than the spectral norm of 
Q we have that Ai < 1. 

Note that for every Vj we have Qj, = ^v_,. Next, recall that Q = Pe and x G E hence Qx = x and 
PqX G Im(PoQ)- By Claim]^we have PqX G Im(Qo.o)- Since Im(Qo,o) = span(vi,..., v^), we may write 
PqX = X) Since ||Pox|| < 1 and {vi,..., v^} is an orthonormal system we have X) rtf < 1- 

Hence 


'7-1 


'7-1 


1 


Xl(/o,o-Qo,o)^ -Qi,o ) -Poxll = ll^a, ] ^(1- A)f - — I v,|| < 


max 


i-o 

1 - (1 - AQ^- 1 

Ai A,' 


< 


A. 

(1 - A)^ 


max 


7-1 

3=0 


< 


A 
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Finally since ||Pow|| < 1 we get that 

||/w, 7 -q(Xo) - /w,q(Xo)|| < 


A.4 Proof of Lemma |3] 

Let Oi < 02 , < ... < 0|o| be the elements of o ordered in increasing order. First by definition we have that: 

7-1 |o| 7-1 |o| 

/w,q(Xo) = X1 Wo„((<3o.oF)n.feXo, = ^ Wo„ ((Qo.o)^ )n.fcXofc (6) 

j —0 n,k—l iGo j — 1 n,k—l 

We also have by definition that for j > 1: 

|o| |o| 

((Qo.oF)n.fe = 5I((Qo.o)^"')n,s((Qo,o))s.fe = ( (Qo.o V .o, 

S =1 S = 1 

By induction we can show that: 

((Qo.o)^)n.fc = ^2 

SiGO 

Reordering the elements we get for j > 1: 

HQo.o)^ )n,k — 'y ^ Qsi,S2 ■ Qs2,S3 ■ ■ ■ Qsj,Sj3.i ( 7 ) 

{s:|s|=j + l,si=o„,Sj + i=Ofc} 

The result now follows from Eq. |^and Eq. |^by a change of indexes. 



A.5 Proof of Corollary]^ 


Choose 

^ ^ fwsi |s| = 1 

• Qsi,S2 ■ Qs 2,S3 ■ ■ ■ Qs|s|_i,So„d l®l ^ ^ 

It is clear from Lemma ||that fl q(xo) = V • (j)j{xo). We only need to show that l|v|| < Vril w||. 
Note that since = Q we have ma,x{\Qij\) < 1. Hence |Vs| < |wsj | and: 






II <r|| 

sGG 


A.6 Proof of Theorem |4] 

By definition of (j)^ we have: 


• '^7(^12^) = H ^^3 


r( 2 ) 


sCoino 2 


EE E 

Z = 1 fcG0in02 sCoin02,Send = fc5|s|=Z 


= E E E 4 '>-f 

1—1 |s|=Z—l.sC 0 in 02 /c€ 0 in 02 
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1 — 1 fcGoino2 l — l A;Goino2 


1 - |oi n 02P 
1 - |oi n 02I 


E 4‘'xf 

/ceoino2 


A.7 Proof of Theorem |2] 


We take (j)^ as in Definition|^ That = \J°„u’)n°o( 2 j|' E^goCDho 


( 2 ) • xp^ is shown in Theorem 


a 


The analysis of sub-gradient descent methods to optimize problems of this form i.e: 

C 


m 


was studied in Shalev-Shwartz et al. (2011 1 and the detailed analysis can be found there (with generalization to mercer 
kernels and general losses). We mention that since I is L-Lipschitz and ||(^.y(xo)|| < vT a bound R on the gradient 
of V£(vT(/).y(Xo),y) = f(vT^^(xo),t/)(/>^(xo) is given by L^T. 

This establishes itemsaand|2l 

Next we let i be an L-Lipschitz loss function and D a A-regular distribution and we assume that 7 > 

Due to Corollary 1^ for some v* G Sr(r) 

E[£{v* ■ (l)^{Xo),y)] < min E ^{flj_Q{Xo),y) 

Applying Lemmaj^and L-Lipschitness, for every ,Q ^ To we have: 

E[f(v* ■ 0^(xo),y)] < E [f(/;,Q(xo,y))] +L 
The result follows from choice of 7 and the inequality —A > log(l — A): 

, loe■2L/(Xe^ , log(^e)/(21^) 

(1-A) A (1-A) 


( 1 -Ar 


< 


e 

TL' 


A.8 Proof of Theorem [T] 

Fix a sample S = {x^J™ ^ and 7 > Let 

- m 

£{v) =E(f(v^^^(xo), 2 /) L(v) = — ^^(v^^^(x;,), 

m 


the expected and empirical losses of the vector v. 

Further denote by 

FcM = +/:(v) F,(v) = ^||v| 12 +£(v) 


2C' 


2C' 


Let C{m) e O ■ R™ Alg. 


with T = m and let v = Theorem 


Item 


we get: 


Tb(m)(v) < minFc(m)(v) + 


(C{m)L'^V{t) 
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Note that II(^^(xo)II < \J ||^ox| < A/r(e). We now apply Corollary 4. ir 
B = ■^r(e) to obtain the following bound (with probability 1 — (5) for every w; 

In particular for every ||w li < vm we have 

From Theorem 1^ itemwe have that for some ||w II 

£{w)< min E(f(wTx,j/)+ e. 

||wit<r 

The result now follows from the choice of m. 


Sridharan et al. 


(2009 1 with 


A.9 Proof of Theorem |3] 


Before proving the theorem, we formally define the sequences for which the algorithm applies: a A-regular sequence 
is one such that the uniform distribution over the sequence elements is A-regular with associated subspace E. 


Proof of Theorem^ Let E* denote the adversarially chosen subspace and Q* The projection associated with it. Since 
the sequence {{x*, o*, yt) is A-regular w.r.t. subspace E*, we have by Lemma]^ 

V||W|| < 1 . |l/;,,_Q,(xo) - /w.Q*(xo)|| < 

Thus, taking /-q* ^ we have 


minwes^ Et ^(/w,Q* (x^J, 2/t) - Et (^oj, 2/t) 

= Et ^(/w,Q (Xot). 2/t) - Et ^(/w- ,7-Q* K*) > 2/t) 

<Et^i!/;,Q(xL)-/rQ«)ii 

< TL{e-^^ 


£ is L-Lipschitz 
Lemma |2] 


Hence it suffices to show that 


Et^(vt 

< Et^K^</'7(XoJ,2/t) -min/„ Qg^^ Et^(/EQ(^oJ>2/t) = 0{VT) 
Corollary [^asserts that 

/w,q(Xo) =V-c/>j(Xo) 

Thus, the theorem statement can be further reduced to 


^f(v7(/'^(x^J,2/t) - min V^(vJ(/i^(x^J, 2 /t) = 0 (Vt) 

t v.GBr(r) ^ 

We proceed to prove equation Eq. |^above. 

Algorithm[2applies the following update rule 

t 

vt-ri = 

2=1 


( 8 ) 
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where wj+i can be re-written as: 


vt+i = (1 - vtp)vt - 

= Vt- r]tVit{-Vt) 


(9) 


where 

£t(v) =£(vT0^(x^J) -h ||lvf 

The above implies a bound on the norm of the gradients of ft, as given by the following lemma: 

Lemma 4. For all iterations t € [T] we have 

livtil < LXVf, |lVft(vt)|| < 2LXVf 

Equation Eq. [^implies that KARMA applies the online gradient descent algorithm on the functions 1 which are 
p-strongly-convex. Hence, the bound of Theorem 3.3 in |Hazan| ( |2014| l, with appropriate learning rates rjt and with 
a = p, G = 2LXVT) by lemma|^ gives 

or 2 ^ 2 T^ 

Vft(vt)-minVlt(v*)< -(1 + logT) 

t t ^ 

This directly implies our theorem since (recall that |jv*|| < B by assumption): 

Et (-{^1 Vt) - min||^||<i (x^J, Vt) 

= HtUx) - minv. Et^'*(v*) + ^(Et llvlp - livtf) 

< + logT) + ^T-B 

□ 


Proof of Lemma^ Eirst, notice that the norms of the gradients of the loss functions i can be bounded by 

\\xe(yjf^(xl,J,yt)\\ = |f'(v7(/'-y(x^O,2/t)l ’ ll</'7(Xo‘)ll < LXVf 

where the last inequality follows from the Lipschitz property of £ and the fact that is a vector in K^, with 

coordinates from the vector x*, and the bound |jx*||oo < X. 

Next, we prove by induction that ||vt II < LXVf. Eor f = 0 we have Vi = 0. Equation Eq. [^implies that Vj+i is 
a convex combination of two vectors: 

l!vt+l|| = ||(l-77tC')Vt -7?if'(Vi^0^(x^J)(/.,y(x^OII 
< maxlcilvtll , |lVf(vT(/,^(x^J)||} 

< max {cLAVe, ||Vf(vtr(?iT,(x^J)||} 

< max jcLAvE, LAvE| 

< lxVt 

We can now conclude with the lemma, by definition of it 

||Vft(vt)|| < |jVf(v7,^^(x^J)|| + ^||vt|| <LAvE+^LAvE<2LAvE 

□ 


induction hypothesis 
above bound on Vf 
C < 1 
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